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1. INTRODUCTION 

In the media, fake news refers to news that has been fabricated and is presented to readers as being 
accurate. People in advanced economies are likely to see to more fake content (70%) than real content. Fake 
news can now be created by humans or by artificial intelligence (AI) [1]. There are numerous fact-checking 
tools available, including NewsGuard and Hoaxy. Fact-checking websites, such as PolitiFact, GossipCop, and 
BuzzFeed [2], are still working on improving their ability to identify false information. However, while the 
quality of news content on social media is lower than that of traditional media, around 50% of Americans in 
2021 get news from social media [3]. The revenue from traditional news media is also shrinking, and the online 
publishers are trying to earn advertising revenue by having more clicks on their content. The distrust of facts 
proffered by the established media is also rising. Because of the rapid dissemination, easy access, and low-cost 
dissemination of news on social media, the number of fake news stories is increasing all the time [4]. 

The goal of the linguistic analysis is to look for language leakage, also called predictive linguistic 
cues to detect fake news. Recent work on automatic detection captures the predictive cues or writing style 
using linguistic features, e.g., lexical, syntax, semantic features of the fake content [5], [6]. The news writing 
style captures the frequency of words accounted in content at linguistic-level, choice between noun/pronoun, 
writing cardinal number (CN), adjectives, using verbs at syntax level, and psycho-linguistic attributes at the 
semantic level. Writers of fake news prefer to use their language strategically to influence human 


psychology. 
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Rashkin et al. [7] presented that language stylistic cues can determine the truthfulness of text. The 
authors compared the language of real news (from English Gigaword corpus) with that of satire (The Onion, 
The Borowitz Report, Clickhole), hoaxes (American News, DC Gazette), and propaganda (The Natural 
News, Activist Report). The authors observed lexicon markers e.g., swear, 2nd person pronoun, modal 
adverb, action adverb, Ist person pronoun singular, manner adverb, gender, see, negation, strong subjective, 
hedge, superlatives, weak subjective were more prominent in fake news, while number, hear, money, 
assertive, and comparatives were more prominent in the truthful news. The fake news detection algorithm is 
further shown to depend on the stance classification of a news [8]. 

Allcott and Gentzkow [9] studied news articles of the 2016 US elections. They collected 156 news 
articles from which 41 were recorded as anti-Trump and 115 as anti-Clinton. Anti-Clinton articles were 
found 30.3 million times shared on Facebook. The sentiment analysis in [10], [11] including the positive and 
negative sentiment of input text for news classification seems promising. A study by Horne and Adali [10] on 
the headline concerning text-body of news for stance classification concluded that headline in fake news 
repeats the main content. 

Efforts are being made to automate the process of fake news detection [12]-[16]. One such 
technique is Generating aRticles by Only Viewing mEtadata Records (GROVER) [17] which generates fake 
news and then uses nucleus sampling at each time step to sample from the most probable words whose 
cumulative probability comprises the top-p% of the entire vocabulary, to create fake news. It gives around 
92% accuracy. However, when it is applied to human written fake news it gives 73% accuracy. Thus, it is 
required to create classifiers that are trained on language written by humans. Sentences created by generative 
models are distinguishable from human generated text due to the property of low variance and small 
vocabulary. This property is used by descriptors to the validity of the text [18]. The success of machine 
learning models depends on feature engineering since all features of a dataset might not be useful in building 
a machine learning (ML) model for prediction [19]—[23]. Accurate selection of effective features is a crucial 
step for applying ML algorithms. The automated approach given by Maronikolakis et al. [24] applies many 
recurrent neural networks (RNN) models to detect headlines created by humans or machine generated news. 
The paper analyses human and machine generated headlines. It was found that humans were only able to 
identify the fake headlines in 45% of the cases, whereas, the most accurate automatic approach of transfer 
learning in the paper achieved an accuracy of 94%. 

Tan et al. [25] presented an approach to detect the semantic incongruities that are present in text and 
image captions generated by automated machines. The approach determines the authenticity score by using 
the co-occurrences of named entities in the text and captions. The word embeddings of captions and image 
are projected into a common visual semantic space which has a property to be built on fine-grained 
interactions between words in the caption and objects in the image. A semantic similarity score is computed 
for every possible pair of projected word and object features. The final authenticity score of an article is 
determined across those of its images and captions. The approach is compared with GROVER [17] model 
and outperformed the same. Various deep learning-based techniques are being studied to improve the 
correlation [26] between features through an attention mechanism. The techniques extend the feature space 
including multimodal features from audio, video or textual representations into the news content and apply 
the attention mechanism to mine the complex correlations. 

In this paper, fake news detection emphasizes the technique to deeply mine the news content while 
using the linguistic analysis and language feature set using ML algorithms. We propose applying correlation 
between features set and class to compute correlation attribute evaluation metric and covariance metric to 
compute variance over the news items. Proposed feature set can differentiate between fake and real news 
with high accuracy (nearly 97 + 2% area under curve (AUC) score) using the AdaBoost model. Main 
contributions of the paper are: 

a) A study of feature set comprising unique words, negative words, neutral words, positive words, 
compound score, noun, adjective, adverb, preposition, CN for fake news classification. 

b) We found a feature set that performed better in comparison to a set of all the features considered in the 
study using the Corr metric. 

c) Results show that the performance of classifiers depend on the news content i.e., linguistic characteristics 
of the news. 

d 


wm 


Proposed methodology works for balanced, imbalanced, and small datasets. 


2. METHOD 

We used four datasets from Kaggle, BuzzFeed, PolitiFact, and FakeNews Challenge as shown in 
Table 1. Kaggle-Guardian Dataset comprises fake news from Kaggle and real news from guardian. The 
Kaggle data set contains text and metadata scraped from 244 different websites tagged as bullshit (BS) by the 
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BS detector chrome extension. We considered only English language news available in the Kaggle dataset. 
The total number of english language news that was found is 11439. To compare the linguistic features of 
fake news and real news, we downloaded 9,724 news items from the guardian using guardian application 
programming interface (API). The news items that we downloaded from the guardian were searched with 
keywords based on terms in the Kaggle dataset. 

BuzzFeed news dataset [2] is collected from fact-checking platform BuzzFeed.com containing news 
content body text, headline, and uniform resource locator (URL) of the news posted on Twitter by the users. 
There are 91 real news and 91 fake news propagated through 634,750 social links by 15,257 users. PolitiFact 
news dataset [2] is collected from fact-checking platform PolitiFact.com similar to Buzzfeed. There are 120 
real news and 120 fake news propagated through 574,744 social links by 37,259 users. Fake News Challenge 
Dataset includes news body text, headline, URL of the news posted with its stance correlated by the user. The 
dataset has news categorized into four classes agree, disagree, unrelated, and discuss. We changed four 
classes to two classes by taking agree class as real news while news with stances - disagree, unrelated, and 
discuss as fake news. There are 49,970 total news with 46,293 fake news and 3,678 real news, this is an 
imbalanced dataset with 1:12 ratio. 


Table 1. Count of news item 
Kaggle and Guardian BuzzFeed _PolitiFact__ FakeNews 
Real News 9,724 91 120 3,678 
Fake News 11,439 91 120 46,293 


2.1. Feature engineering 

To create a feature set, the text of the news was tokenized using word tokenize function of Python 
nltk library. All tokens that were stop-words as per nltk corpus were removed to create clean text. 
SentimentIntensityAnalyzer and pos tag were used on the stem of the words from clean text to compute 
sentiment and parts of speech (POS) tag for each word. SnowballStemmer of Python was used to consider 
stem of the word to ignore different forms of the word. Frequency of POS tag and sentiment categories were 
computed for each news item for all the datasets to create features set. 


2.1.1. Features set 

Features set comprises unique words, negative words, neutral words, positive words, compound 
score, noun, adjective, adverb, preposition, verb in base form (VB), verb past tense (VBD), verb in gerund or 
present participle (VBG), verb in past participle (VBN), verb in 3rd person singular present (VBZ), and CN. 
Unique words represent the number of words that are unique in the given text. Unique words were observed 
to make 60-100% of fake news while for real news found in the range 20-80%. Positive and negative words 
represent a measure for identifying the sentiments in the text in terms of intensity and polarity towards 
emotions [27]. For example, in comparison of two sentences “the person is superb” and “the person is good,” 
the sentence “the person is superb” is considered more sensitive in sentiment intensity analyzer. 

Valence aware dictionary for sentiment reasoning (VADAR) [28], a sentiment lexicon available in 
Python, was used for sentiment analysis. VADAR considers acronyms, initialism like laugh out loud (LOL), 
emoticons like ;), or slang like nah as crucial for sentiment analysis. The VADAR provides a compound 
score for intensity scale between -10 to +10. We computed a percentage of negative, positive, and neutral 
words in both real and fake news. We also considered other grammatical and linguistic features like VB 
which represents verb base form (for example take), VBD represent sverb past tense (for example took), 
VBG represents verb gerund/present participle (for example taking), VBN represents verb past participle (for 
example taken), VBZ represents verb 3rd person present (for example takes) and CN represents cardinal 
number. 


2.1.2. Features selection 

Selection of features is based on correlation attribute evaluation metric and covariance metric, which 
are computed using correlation between features set and class, and correlation among the features. 
Co-variance of features over the news items is defined by (1) and (2). Let f,, fo, ..., f, represents frequency 
of all linguistic n features for all m news. We computed ;¢g; mean frequency of feature f; over m, real news 
and Uyaxe mean frequency of feature f; over mz fake news. We then calculated covariance of each feature for 
each real and fake news item as shown in (1) and (2) respectively. We then worked on Corr metric to filter 
the appropriate features. Correlation Attribute evaluation metrics are combined to select the features, so the 
approach is named as Corr [28]. 
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Step 1: Correlation value shows how much one variable changes for a slight change in another 
variable, and covariance is the direction of the linear relationship between variables. In the proposed method, 
correlation attribute evaluation metric is evaluated between feature and class (Corr;,) averaged over k 
features. The correlation metric is also evaluated (Corr;;) with average over k features and the Corr metric 
is calculated, the features with high relationship values are selected. If these values are higher than the 
specified threshold assign value, then the feature is effective, and list is computed in descending order. For 
evaluation of correlation between the features (Corr) correlation between the features set and feature class 
(Corr;,) is calculated. If the correlation between features set and its class is strong, it indicates strong 
correlation between the features set and class. The wrapper technique is applied to filter the features 
accurately and select effective features for the selected ML algorithms. 

In this technique, features are placed in ascending order with respective correlation values. 
Afterward, a threshold value is assigned, if feature correlation values are higher than a specified threshold 
assigned value the feature is put forward in the descending order. We observed that features-unique, positive, 
negative words and CN are having higher correlation with class and among each other rather than noun, 
adjective words. Here, we combine (1), (2) and (3) and define correlation and covariance attribute evaluation 
metrics (CorrCov metric) that is presented in (4) and Table 2. 


Table 2. Correlation-covariance attribute evaluation metric 


Attribute Kaygl Orr, 
nrCovargn + Jk + k(k — lavgCorry, 
Al = 0.6 
A2 =0.4 
A3 =0.3 


Step 2: We averaged the co-variance of each feature (nrCovar;,) over all news items in the dataset 
presented in (1) and (2), and after further normalization, the feature was put in a list in descending order. 
Step 3: Next step is to filter each feature by using the AUC metric of specific ML algorithm. However, the 
algorithm filters each feature one by one using AUC metric and select those features which give high AUC 
metric values. The ML algorithms Naive Bayes, decision tree, random forest, K-nearest neighbor, AdaBoost, 
and support vector machine (SVM) are used to evaluate the AUC metric. Step 4: Final step is verification 
phase to apply Shannon entropy (using (5)) and technique for order of preference by similarity to ideal 
solution (TOPSIS) [29], [30] to get desired selected effective feature set in Table 3. 


COVGT-ealitnyeature a (fi — fireat:) as Gi Ss Hreal,) (1) 
COVArpakeitn feature = (fi _ Urare;) 7 CF “* Hyake;) (2) 
RavgCorr f¢ (3) 
k+k(k-1)avgCorr sf 
RavglOrr f¢ (4) 
nrCovar fnt |k+k(k-1)avgCorr sr 
ent = —In(n)~? YL, A; In(A,) (5) 


Table 3. Decision matrix 


Attribute High- Medium- Medium- Low- Very High- High- Low- Very Low- Medium- 
Al A2 A3 Al A2 A3 Al A2 A3 
Writer 0.7 0.5 0.4 0.3 0.9 0.7 0.3 0.1 0.5 
Writer2 0.8 0.4 0.5 0.2 0.8 0.6 0.2 0.2 0.4 
Writer3 0.6 0.4 0.5 0.1 0.8 0.7 0.1 0.3 0.5 
ent 0.651 0.954 0.954 0.834 0.458 0.715 0.834 0.834 0.954 
div 0.349 0.046 0.046 0.166 0.542 0.285 0.166 0.166 0.046 
wet 0.134 0.026 0.025 0.093 0.306 0.566 0.093 0.093 0.026 
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Different writers have their different writing style of a news while using language attributes 
(adjectives/adverbs) to write a news. Table 3 is the decision matrix (DM) representing possible different 
values of selected features. In this research, A, = positive, A, = negative, A; = unique, A, = 
cardinalnumber, A; = variance features are found to be effective. Let the features sentiments, nouns, 
adjectives have range high = 0.8 — 0.6, medium = 0.5 — 0.4,and low = 0.3 — 0.1 values in various news 
items then the DM = [p;;]axp for different writers is shown in Table 3 where a is the number of writers of 
news items and b is the number of features. Different classifiers may use different informative feature 
selection criteria and therefore differ in classification with different weight choices presented in Table 4. 


Table 4. Representation for one classifier Cl 


Classifier High- Medium- Medium- Low- Very High- Low- Very Medium- Wet 
Al A2 A3 Al High-A2 A3 Al Low-A2 A3 Choice 
Cl 1 0 0 0 0 0 0 0 1 0.16 
C2 1 1 0 0 0 0 0 0 1 0.186 
C3 1 1 1 1 0 0 0 1 1 0.397 
wet 0.134 0.026 0.025 0.093 0.306 0.566 0.093 0.093 0.026 


In (5) and (6) provide a quantization of the attributes. The quantization of different classifiers may 
be used further for training over the datasets by maximization or minimization as in Table 4. Shannon 
entropy (1-divergence) is the measure of uncertainty and TOPSIS is a statistical method in Table 5 to give 
ranking of design alternatives. TOPSIS is applied to choose the best solution on the basis of Euclidean 
distance, shortest distance from the ideal solution (PIS) and the farthest from the negative ideal solution 
(NIS) in Table 5. Each classifier measures the distance (Aj, Az) from PIS and NIS; Therefore, the combined 
separation distance can be given as: A,= Jd ,A;,, A= An, where A= (wgtx - 
MAX (wgt;,)*) and Ag= (wgt;, — MIN (wgt;,,)*). Each classifier measure closeness of each feature to PIS 


A : : : . : : 
as ne = mae Features obtained using Cov-Corr metric are listed in fSet2 in Table 6. 
kt AK 


(6) 


Table 5. Distance from ideal and negative ideal solution 


Expert PIS NIS 
Model 1 0.397 0.16 
Model 2 0.697 0 
Model 3 0.06 0.1 


Table 6. Feature sets 
Feature set name List of features 
fSetl unique, negative, neutral, positive, compound, noun, adjective, adverb, 
preposition, VB, VBD, VBG, VBN, VBZ, CN, negativeVar, positive Var, cnVar 
fSet2 unique, negative, positive, CN, uniqueVar, negativeVar, positive Var, cnVar 


3. RESULTS AND DISCUSSION 

We used two sets fSetl and fSet2 as presented in Table 6. The fSetl comprises all features 
considered to study fake and real news items of four datasets, whereas fSet2 comprises of limited features 
obtained using Cov-Corr metric. We implemented Naive Bayes, decision tree, random forest, k-nearest 
neighbors, AdaBoost, SVM algorithms to compare the classification results using fSet1 and fSet2. The scores 
are obtained by randomly splitting the datasets in the ratio of 0.7:0.3 for the training and cross-validation sets. 
Figure | shows comparison of AUC of algorithms when applied on fSet1. Figure 2 shows comparison of F1 
score of algorithms when applied on fSetl. Figure 3 shows comparison of AUC of algorithms when applied 
on fSet2. Figure 4 shows comparison of F1 score of algorithms when applied on fSet2. 

The AUC score is computed from precision-recall curve. We observe high AUC scores for FNC-1 


dataset which has more fake news than real news (imbalanced). In Figures 1, 2, 3 and 4, we observe less F1- 


2*Precision*Recall 


score (F1 = ) in comparison to AUC score in the classifications by all the ML algorithms. 


Precision+Recall 
We observe that positive words, negative words, unique words, and CN are the prominent features for the 
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fake news detection from a linguistic analysis of text since the correlation and covariance for the features of 
fSet2 was higher for fake news than real news as resulted in Figures 1 and 2, however, for rest all other 
features correlation and covariance was similar. The Figure | depicts AUC score obtained using the unique 
words, negative words, positive words, and CN and the variance of the features (fSet2). In fSet1, we used all 
the features, however, we could achieve comparable performance with a reduced set of the feature set. 


fSet1 Datasets 
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Figure 1. Comparison of AUC score of algorithms on fSet1 with varying datasets 
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Figure 2. Comparison of F1 score of algorithms on fSet! with varying datasets 


fSet2 Datasets 
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Figure 3. Comparison of AUC score of algorithms on fSet2 with varying datasets 


Decision tree algorithm outperformed in comparison to other algorithms except for random forest 
and AdaBoost. We found that BuzzFeed dataset is the most challenging dataset for all ML algorithms. 
Decision tree algorithm improved AUC Score from 70% with fSet1 to 83% with fSet2 on BuzzFeed dataset. 
Gaussian Naive Bayes classifier did not perform well in this particular example of fake/real news 
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identification for all the datasets. Gaussian Naive Bayes classifier uses statistical information of mean and 
variance of each feature individually over the dataset and then find the joint conditional probability of all 
features to find the unique range of values for each class. In the FNC -1 dataset which has the highest AUC 
score with all algorithms, Naive Bayes classifier obtained the AUC score of 94% for fSetl and fSet2. 
However, we obtained AUC score of nearly 100% with fSet2 using AdaBoost with base estimator decision 
tree classifier as shown in Figure 3. One of the reasons may be that naive assumption of gaussian Naive 
Bayes may not be true since the number of parts of speech depends on each other. 


fSet2 Datasets 
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Figure 4. Comparison of F1 score of algorithms on fSet2 with varying datasets 


Figure 3 and Figure 4 show that fSet2 outperforms for all the datasets with best performance by 
Adaboost followed by random forest. There is performance difference of the algorithms on the datasets 
wherein SVM classifier performed similar for all datasets, including BuzzFeed dataset shown in Figure 3. 
The SVM classifier is found to be the slowest classifier. We conducted an extensive study to vary all 
parameters, to find out the best values of hyperparameters for the performance metric. Figures 1, 2, 3 and 4 
show F1 and AUC scores for different classifiers and different feature sets fSetl and fSet2 using the best 
hyperparameters. We also compared the performance of algorithms when using the best hyperparameters and 
using the default set of hyperparameters. There was limited performance gain for the algorithms except for 
random forest classifier and AdaBoost classifier, which improved significantly. The AUC scores of random 
forest classifier and AdaBoost classifier were improved by 13% and 7% respectively. 

Random forest classifier uses subsamples of the feature set to fit into decision tree classifier, and 
then ensemble obtained trees to predict the class. AdaBoost is a boosting algorithm and is used with week 
classifiers. In our example with default parameters, it showed the AUC score of 90% with default estimators 
as decision tree classifier, learning rate 1 and no of estimators as 50. We increased no of estimators to 400 
and improved accuracy by 7% leading to 97% AUC score. Now, we present analysis of datasets: 

Imbalanced dataset (FNC-1): We observed that fSet2 outperforms in comparison to fSetl with 97 + 
2 % AUC score. Even though the dataset is skewed, the performance of ML algorithms is up to the mark for 
both fSet1 and fSet2. Since feature set fSet2 outperformed in comparison to fSet1, therefore we conclude that 
even though dataset FNC-1 is imbalanced but the frequency of features (e.g. number of unique words, 
number of positive sentiments words in the news items) was sufficient to perform the accurate classification. 
We observed that for this dataset fSet1 (other linguistic features in the fake news items) performance is also 
significant enough due to large numbers of news items (46,293 fake news+3,678 real news) with repeated 
information for ML algorithms to capture the features from real news and fake news. We observed biased 
predictions due to imbalance news items in few cases (e.g. the model predicted fake news items with higher 
accuracy than real news items) and therefore this example presents a scenario of limitation of ML algorithms 
in avoiding automation of bias [31]. 

Limited size datasets (PoltiFact and BuzzFeed): The classifiers resulted in low AUC score in 
comparison to other two datasets. The AUC Score with fSet2 feature set for Buzzfeed is 74 + 5 %. The AUC 
Score for PolitiFact with fSet2 feature set is 90 + 10 %. Results show that even the datasets are in limited 
size but the frequency of features in the news is significantly enough therefore the same feature set fSet2 
outperformed for the datasets. PolitiFact dataset is better even with fSetl even though limited in number of 
real and fake news items. 
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Balanced dataset (Kaggle-Guardian): Feature set fSet2 in comparison to fSetl improved the 
performance for this dataset up to 91 + 14 % AUC score. Results show that even the dataset is balanced but 
the frequency of features in the dataset is comparable, therefore, the same feature set fSet2 outperformed for 
the dataset but with less AUC score than FNC-1. This dataset is difficult for classifiers (less AUC score in 
comparison to others) though it is balanced. 


4. CONCLUSION 

A study on feature sets over four fake news datasets using ML algorithms conclude that feature set 
fSet2 is the reduced feature set over the fSetl since random forest, AdaBoost, k-nearest neighbor, and SVM 
classifiers obtained high AUC score for fSet2 in comparison to fSetl. The fSet2 is computed using 
covariance and correlation attribute evaluation metric. The four datasets considered under study were having 
different proportion of real and fake news items. Thus, the proposed approach has been tested for limited 
size, imbalanced and balanced datasets. Fake news can be written in regional languages used across the globe 
to spread the distrust among the local public. Detecting fake news for the regional content is challenging 
since regional languages have different linguistic features with limited availability of datasets. Future work is 
proposed over language features for regional languages. 
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